Project Introduction

In this project, red wine quality is going to be explored, and analyzed. I will uitilize several data analysis techniques to find insights in one or multiple variables using R.

library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
library(dplyr)
library(tidyr)
# Load the Data
dt <- read.csv('wineQualityReds.csv')

Univariate Plots Section

Let’s take a look at some summary statistics on the dataset first.

#summary statistic
str(dt)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(dt)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

In the dataset, there’re 13 variables and 1599 rows.The variable X should be the index.Quality is the “Y” variable we are interested and the rest except X are “X” variables that we are going to analyze their influence on the quality. When we look at quality, we found it ranged from 3 to 8 with an average of 5.6 and a median of 6.

Then i looked at the distribution plot of all 12 variable.

From the histogram, we can see that most variables are left skewed, with pH and density to be approxiamately normal distribution.For all the left skewed variables, residual sugar and chlorides seem to have long tails.

Then, let’s take a look at boxplots.

From the boxplots, we can see that most variables have outliers, expecially residual sugar and chlorides.We’ll decide if we nned to remove outliers later in the analysis.

Univariate Analysis

What is the structure of your dataset?

The shape of the dataset is (1599,13). There’re 1599 wine records and 13 variables(with X the index of the dataset).

Among the 12 variables of the wine, first 11 are physicochemical data points on wine samples and the 12th one, quality, is an 10-point scale output based on sensory data from at least three wine experts.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. From the Univariate Plots Section, we see that it’s nearly a normal distribution where most of observations are in the 5-6 range.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Although all variables could potentially impact the wine quallity, through some high level research, the acidity is a major factor that influence the wine. So i think fixed acid , volatile acid, citric acid and pH are significent.

Did you create any new variables from existing variables in the dataset?

I created two variables. Quality_score and acidity.Quality score is to group the quality into three buckets - poor,mid and good. Because most wines are 5 or 6, i’ll assume 5,6 to be mid level.Everyone below 5 is poor, above 6 is good.Acidity is the combination of fixed acid , volatile acid and citric acid.The sum of three could be a significant feature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid is unusual compared with fixed acidity and volatile acidity.The latter two are more like a bell shaped distribution but citric acid more like exponential. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 132 rows containing non-finite values (stat_bin).

In general, the dataset is tidy and no other cleaning needed.

Bivariate Plots Section

For bivariate analysis, i’ll start with creating 11 box plots to find relationships between quality and each features. The reason i use quality score instead of quality is that we have fewer group in quality score which could help us see the relationship more clearly in the plot.

From the above boxplot, we could see that fixed acidity,volatile acidity and citric all have relationship with quality score. Fixed acidity and citric acidity have positive relationship while volatile has negative relationship. Acidity has a slightly positive relationship so i dont think the derived variable is better than 3 separate variables. Besides, sulphates and alcohol are also positively correlated with wine quality.

To demonstrate what we saw in the plot, i calculated teh correlation between each variables.

##                    X        fixed.acidity     volatile.acidity 
##           0.06645261           0.12405165          -0.39055778 
##          citric.acid       residual.sugar            chlorides 
##           0.22637251           0.01373164          -0.12890656 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH            sulphates              alcohol 
##          -0.05773139           0.25139708           0.47616632 
##              quality              acidity        quality_score 
##           1.00000000           0.10375373           0.81236704

We can see that: 1. alcohol has the strongest correlation, followed by volatile acidity. 2. fixed acidity and citric acidity had a positive correlation while volatile acidity had a negative correlation. 3. sulfur.dioxide and pH have really low correlation(~ 0.05).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the above boxplot, we could see that fixed acidity,volatile acidity and citric all have relationship with quality score. Fixed acidity and citric acidity have positive relationship while volatile has negative relationship. Acidity has a slightly positive relationship so i dont think the derived variable is better than 3 separate variables. Besides, sulphates and alcohol are also positively correlated with wine quality. Those findings could be demonstrated by the correlation test.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  1. acidity and pH.

##        cor 
## -0.6829782

##       cor 
## 0.2349373

##        cor 
## -0.5419041

We could see that for fixed acidity and citric acidity, the relationship is negative.While for volatile acidity, the relationship is slightly positive.

What was the strongest relationship you found?

Top 5 features most correlated with quality:

1.alcohol: 47.6% 2.volatile acidity: -39.1% 3.sulphates: 25.1% 4.citric acid: 22.6% 5.total.sulfur.dioxide: -18.5%

Multivariate Plots Section

In this multivariate plot section, i’ll analyze if there’re any interactions between the above 5 features.

For alcohole and volatile.acidity, We could see a clear distinction of the surface with poor wine (high volatile acidity and low alcohol content) and good wine (low volatile acidity and high alcohol content).

For alcohole and citric acid, didnt find clear interaction

for alcohole and sulphate, we found a clear distinction of poor wine (low sulphate and low alcohol content) and good wine (high sulphate and high alcohol content).

for alcohol and total sulfur dioxide,didnt find clear interaction

for citric.acid and volatile.acidity, the distinction is not very clear, but still could see difference between poor(low citric.acid and high volatile.acidity) and good(high citric.acid and low volatile.acidity)

For total.sulfur.dioxide and volatile.acidity, didnt find clear pattern of difference.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the multivariate plots, i found that alcohol with volitile.acidity and alcohol with sulphate bring strong interaction effect. citric.acid with volatile.acidity bring some interaction, but not very strong. alcohole with citric acid,alcohol with total sulfur dioxide and total.sulfur.dioxide with volatile.acidity bring no interaction effect. Although they each are strong correlated with quality, they didnt strengthen each other by interaction.

Were there any interesting or surprising interactions between features?

One interesting interaction is between citric.acid with volatile.acidity. They have some degree of interaction effect while for alcohol with citric acid, we could not find the interaction effect although alcohol is a more significant feature than volatile.acidity


Final Plots and Summary

From the final plots below, it can be found that volatile acidity, alcohol and sulphates contribute to good wines.

This plot tells us that good wine is more alcohol and more sulphate.Because there’s clear distinction between poor and good wines in the plot, i’ll say that alcohol and sulphate are two important factors influencing quality of wine.

This plot tells us that good wine is more alcohol and less volatile acidity.Because there’s clear distinction between poor and good wines in the plot, i’ll say that alcohol and volatile acidity are two important factors influencing quality of wine.

This plot tells us that good wine is more alcohol and more citric acidity.Reason i chose this plot is not only because i found clear difference in the plot, i also noticed that impact of citric acidity is opposite to volatile acidity(one is negative, the other is positive), which is interesting.

In conclusion, these three scatter plots tell us that good wine is more alcohol, more sulphate,more citric acid and less volatile acidity.And notice that citric acid and volatile acidity brings opposite impact to wine quality.


Reflection

Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. The project show us a systemetic way of analyzing and visualizing a dataset. It starts from univariate analysis, understanding the dataset and distribution of each variables. Although it seems useless in this project, it could be very helpful if there’re data quality issues in our dataset. The bivariate analysis later starts bringing insights of the dataset, helping finding most significant features that influence wine quality.I found the top 5 features that mostly correlated with the wine quality: 1.alcohol: 47.6% 2.volatile acidity: -39.1% 3.sulphates: 25.1% 4.citric acid: 22.6% 5.total.sulfur.dioxide: -18.5%

It helps me continuing my next step of analysis, the multivariate analysis.The third step is the most important one as it help find the key insights of the dataset, by revealing the interaction effect between variables.I found that alcohol with volitile.acidity and alcohol with sulphate bring strong interaction effect. Finally, by doing a final analysis, i came up with my conclusion that good wine is more alcohol, more sulphate and less volatile acidity.